Hierarchical link clustering

Time complexity

It involves the following steps:

Similarity calculation
Hierarchical clustering

Similarities between edges are calculated by visiting each node ( $i$ ) and measuring the Jaccard coefficient of each edge pair. The time complexity of each operation (Jaccard) is $\max(k_1, k_2)$ , where $k_1$ and $k_2$ are the degrees of the two nodes that are connected to $i$ . The worst case is a hub that is connected to everyone: $O(N)$ . If the degree distribution is bounded or if we ignore the pairs with at least one node with more than a certain degree, then the complexity is $O(1)$ .

Then, we need ${k_i \choose 2}$ of this operation for each node $i$ . There are $N p(k)$ nodes that have a degree $k$ . For every node, the number of operations is $N \sum_k p(k) {k \choose 2 }.$ Without hubs (e.g. Poisson distribution), the time complexity is $O(N)$ as $\sum_k p(k) {k \choose 2 }$ converges to a value that does not scale as $N$ . If $p(k)\sim k^{-3}$ , the sum scales logarithmically, and thus the time complexity becomes $O(N \log N)$ . For a general case with $p(k) \sim k^{-\alpha}$ ( $2 < \alpha < 3$ ), using $k_{\text{max}} \sim N^{1/(\alpha - 1)}$ (see this), we can calculate the sum scales as $N^{\frac{3-\alpha}{\alpha-1}}$ . The time complexity is therefore $N^{\frac{2}{\alpha - 1}}$ .

The whole process of the similarity calculation requires $\begin{cases} O(N) & \text{If the second moment of the degree distribution does not scale with} ~ N \\ O(N \log N) & \text{If}~ \alpha = 3 ~\text{and the Jaccard calculation is bounded} \\ O(N^{\frac{2}{\alpha - 1}}) & \text{If} ~2 < \alpha < 3~ \text{and the Jaccard calculation is bounded} \\ O(N^4) & \text{If the network is fully connected and if we calculate all similarities} \end{cases}$

Once we have the similarities of edge pairs, we can run the full hierarchical clustering or sample several thresholds to identify the optimal threshold. If we do the full clustering, we would need $O(M^2)$ . If we use the sampling approach, (need more analysis)…

So the total time complexity is $O(\text{similarity}) + O(\text{hierarchical clustering})$ and it’ll depend on the types of networks. For instance, if the network is a scale-free network with exponent 3, if we set a certain degree threshold for calculating the similarity, and if the number of edges is $O(N \log N)$ or smaller, then the total time complexity will be $O(N \log N)$ .